273 research outputs found
Dense Voxel 3D Reconstruction Using a Monocular Event Camera
Event cameras are sensors inspired by biological systems that specialize in
capturing changes in brightness. These emerging cameras offer many advantages
over conventional frame-based cameras, including high dynamic range, high frame
rates, and extremely low power consumption. Due to these advantages, event
cameras have increasingly been adapted in various fields, such as frame
interpolation, semantic segmentation, odometry, and SLAM. However, their
application in 3D reconstruction for VR applications is underexplored. Previous
methods in this field mainly focused on 3D reconstruction through depth map
estimation. Methods that produce dense 3D reconstruction generally require
multiple cameras, while methods that utilize a single event camera can only
produce a semi-dense result. Other single-camera methods that can produce dense
3D reconstruction rely on creating a pipeline that either incorporates the
aforementioned methods or other existing Structure from Motion (SfM) or
Multi-view Stereo (MVS) methods. In this paper, we propose a novel approach for
solving dense 3D reconstruction using only a single event camera. To the best
of our knowledge, our work is the first attempt in this regard. Our preliminary
results demonstrate that the proposed method can produce visually
distinguishable dense 3D reconstructions directly without requiring pipelines
like those used by existing methods. Additionally, we have created a synthetic
dataset with object scans using an event camera simulator. This
dataset will help accelerate other relevant research in this field
Fine-grained Activity Classification In Assembly Based On Multi-visual Modalities
Assembly activity recognition and prediction help to improve productivity, quality control, and safety measures in smart factories. This study aims to sense, recognize, and predict a worker\u27s continuous fine-grained assembly activities in a manufacturing platform. We propose a two-stage network for workers\u27 fine-grained activity classification by leveraging scene-level and temporal-level activity features. The first stage is a feature awareness block that extracts scene-level features from multi-visual modalities, including red, green blue (RGB) and hand skeleton frames. We use the transfer learning method in the first stage and compare three different pre-trained feature extraction models. Then, we transmit the feature information from the first stage to the second stage to learn the temporal-level features of activities. The second stage consists of the Recurrent Neural Network (RNN) layers and a final classifier. We compare the performance of two different RNNs in the second stage, including the Long Short-Term Memory (LSTM) and the Gated Recurrent Unit (GRU). The partial video observation method is used in the prediction of fine-grained activities. In the experiments using the trimmed activity videos, our model achieves an accuracy of \u3e 99% on our dataset and \u3e 98% on the public dataset UCF 101, outperforming the state-of-the-art models. The prediction model achieves an accuracy of \u3e 97% in predicting activity labels using 50% of the onset activity video information. In the experiments using an untrimmed video with continuous assembly activities, we combine our recognition and prediction models and achieve an accuracy of \u3e 91% in real time, surpassing the state-of-the-art models for the recognition of continuous assembly activities
Mitigating Representation Bias in Action Recognition: Algorithms and Benchmarks
Deep learning models have achieved excellent recognition results on
large-scale video benchmarks. However, they perform poorly when applied to
videos with rare scenes or objects, primarily due to the bias of existing video
datasets. We tackle this problem from two different angles: algorithm and
dataset. From the perspective of algorithms, we propose Spatial-aware
Multi-Aspect Debiasing (SMAD), which incorporates both explicit debiasing with
multi-aspect adversarial training and implicit debiasing with the spatial
actionness reweighting module, to learn a more generic representation invariant
to non-action aspects. To neutralize the intrinsic dataset bias, we propose
OmniDebias to leverage web data for joint training selectively, which can
achieve higher performance with far fewer web data. To verify the
effectiveness, we establish evaluation protocols and perform extensive
experiments on both re-distributed splits of existing datasets and a new
evaluation dataset focusing on the action with rare scenes. We also show that
the debiased representation can generalize better when transferred to other
datasets and tasks.Comment: ECCVW 202
Experimental study on thermal runaway risk of 18650 lithium ion battery under side-heating condition
When Urban Region Profiling Meets Large Language Models
Urban region profiling from web-sourced data is of utmost importance for
urban planning and sustainable development. We are witnessing a rising trend of
LLMs for various fields, especially dealing with multi-modal data research such
as vision-language learning, where the text modality serves as a supplement
information for the image. Since textual modality has never been introduced
into modality combinations in urban region profiling, we aim to answer two
fundamental questions in this paper: i) Can textual modality enhance urban
region profiling? ii) and if so, in what ways and with regard to which aspects?
To answer the questions, we leverage the power of Large Language Models (LLMs)
and introduce the first-ever LLM-enhanced framework that integrates the
knowledge of textual modality into urban imagery profiling, named LLM-enhanced
Urban Region Profiling with Contrastive Language-Image Pretraining (UrbanCLIP).
Specifically, it first generates a detailed textual description for each
satellite image by an open-source Image-to-Text LLM. Then, the model is trained
on the image-text pairs, seamlessly unifying natural language supervision for
urban visual representation learning, jointly with contrastive loss and
language modeling loss. Results on predicting three urban indicators in four
major Chinese metropolises demonstrate its superior performance, with an
average improvement of 6.1% on R^2 compared to the state-of-the-art methods.
Our code and the image-language dataset will be released upon paper
notification
Advancements in Repetitive Action Counting: Joint-Based PoseRAC Model With Improved Performance
Repetitive counting (RepCount) is critical in various applications, such as
fitness tracking and rehabilitation. Previous methods have relied on the
estimation of red-green-and-blue (RGB) frames and body pose landmarks to
identify the number of action repetitions, but these methods suffer from a
number of issues, including the inability to stably handle changes in camera
viewpoints, over-counting, under-counting, difficulty in distinguishing between
sub-actions, inaccuracy in recognizing salient poses, etc. In this paper, based
on the work done by [1], we integrate joint angles with body pose landmarks to
address these challenges and achieve better results than the state-of-the-art
RepCount methods, with a Mean Absolute Error (MAE) of 0.211 and an Off-By-One
(OBO) counting accuracy of 0.599 on the RepCount data set [2]. Comprehensive
experimental results demonstrate the effectiveness and robustness of our
method.Comment: 6 pages, 9 figure
- …